DS6040: Homework 1: Probability Review and Priors

Diana McSpadden (hdm5s)

Q1 (15 points):

You are a data science and are choosing between three approaches, A, B, and C, to a problem.

You are equally likely to choose among unselected options.

What is the expected time in days for you to obtain the results you are looking for? What is the variance on this time?

Answer:

Q2 (15 points)

Suppose if it is sunny or not in Charlottesville depends on the weather of the last three days. Show how this can be modeled as a Markov chain

Answer:

from what I read the Marchov Property says that "only the most recent point in the trajectory affects what happens next" (https://www.stat.auckland.ac.nz/~fewster/325/notes/ch8.pdf)

We can create a transition matrix for Sunny Today To Sunny Tomorrow - where I am stating that there is an 80% probability it is sunny tomorrow if it is sunny today, and a 60% chance it if sunny tomorrow if it is not sunny today.

Sunny Tomorrow Not Sunny Tomorrow
Sunny Today 0.8 0.2
Not Sunny Today 0.6 0.4

The question does not ask, but the probability of 3 sunny days would be 0.8^3.

This question does ask for a multi-step transition diagram and table:

What are the 8 possible sequences of S and R for 3 days?

Here are my drawings of single step and the continual cycles from each state to itself and the other states:

alt text

Question 3 (15 points)

Assume a Gaussian distribution for observations, Xi, i = 1, . . . , N with unknown mean, M, and known variance 5.

Suppose the prior for M is Gaussian with variance 10.

How large a random sample must be taken (i.e., what is the minimum value for N) to specify an interval having unit length of 1 such that the probability that M lies in this interval is 0.95?

Answer:

What is minimum value of N?

Work:

Using Bayes Theorem, the posterior variance is the Likelihood * Prior / Evidence =

sigma-prior^2 sigma^2 / (sigma^2 + (N sigma-prior^2)) =

(10 5) / (5 + (10 N)) =

50 / 5 + 10N

Because we are using a normal/Gaussian distribution our 95% condidence range is [-1.96 sigma, 1.96 sigma]

But we want this range to be unit length 1, so we need 1.96 * sigma to be 0.5

sigma = square root(50 / (5 + 10N))

sigma needs to equal 0.5 / 1.96

solving for N:

N = 385 random samples

Question 4 (15 points)

You have started an online business selling books that are of interest to your customers. A publisher has just given you a large book with photos from famous 20th century photographers. You think this book will appeal to people who have bought art books, history books and coffee table books. In an initial offering of the new book you collect data on purchases of the new book and combine these data with data from the past purchases (see ArtHistBooks.csv).

Use Bayesian analysis to give the posterior probabilities for purchases of art books, history books and coffee table books, as well as, the separate probabilities for purchases of the new book given each possible combination of prior purchases of art books, history books and coffee table books.

Do this by first using beta priors with values of the hyperparameters that represent lack of prior information.

Then compute these probabilities again with beta priors that show strong weighting for low likelihood of a book purchase. Compare your results.

Answer:

Posterior Probabilities For:

Posterior Probabilities For:

Posterior Probability For:

Posterior Probabilities for Purchasing New Book for all combinations with an informed prior indicating a low likelihood of purchasing the new book:

I used alpha = 10, and beta = 110 to represent a mean of 8.3%.

Question 4 Comments: From both the confidence intervals and posterior probability distributions one can see that the probability of purchasing the new books increases as the subpopulation of book purchasers becomes more targeted. The CI of probability has the highest upper bound for customers who purchased Art, History and Table Books.

The change in probability from the prior to the posterior using Bayes Theorem is also of interest as the likelihood function takes into account the differing subpopulation behaviors. From the plots and CI's using informed priors, purchasing both art and history books appear to have the greatest predictive impact on the purchasing of the new book with the smallest CI range and almost as large a max probability in the CI.

Question 5 (15 points)

The data set CHDdata.csv contains cases of coronary heart disease (CHD) and variables associated with the patient’s condition: systolic blood pressure, yearly tobacco use (in kg), low density lipoprotein (ldl), adiposity, family history (0 or 1), type A personality score (typea), obesity (body mass index), alcohol use, age, and the diagnosis of CHD (0 or 1).

Perform a Bayesian analysis of these data that finds the posterior marginal probability distributions for the means for the data of patients with and without CHD.

You should first standard scale (subtract the mean and divide by the standard deviation) all the numeric variables (remove family history and do not scale CHD). Then separate the data into two sets, one for patients with CHD and one for patients without CHD.

Your priors for both groups should assume means of 0 for all variables and a correlation of 0 between all pairs of variables. You should assume all variances for the variables are 1.

Use a prior alpha equal to one plus the number of predictor variables.

Compute and compare the Bayesian estimates for the posterior means for each group.

example of the question is: how confident are we that tobacco use is positive we will have chd

need to compute the posterior CI

we are given information about priors - we are given prior mean and stdv.

Answer:

Split into chd and no chd

For each columns I want to calculate the "posterior marginal probability distributions for the means for the data of patients with and without CHD.

Without CHD:

With CHD

I wanted an idea of the distribution spread for each of the variables before I selected distributions to use:

To model the continuous mean for each of the 8 predictor variables I will use a NEED TO CHOOSE distribution.

Based on the distribution plots above I feel that a Gaussian with Known Variance distribution is appropriate for several of the variables: sbp, ldl, adiposity, typea, and obesity.

First, I will use the Gaussian_Known_Variance distribution for all the predictors and analyze the means in the CHD = 0 and CHD = 1 subpopulations to perform what I am thinking of as a "Bayesian 2-sample t-test" to decide if the means are significantly different in the subpopulations, and this determine how confident we are that the predictor contributes to CHD.

After using the Gaussian Known Variance distribution for all predictors I will also calculate posterior mean and scale using the Gaussian Likelihood with Unknown Mean and Unknown Precision using calculations provided in Module 2.

First, Using Gaussian Known Variance

Gaussian Known Varianace Analysis of Means for Predictors in CHD = 0 & CHD = 1

CHD predictor CHD = 0 posterior mu CI (and width) CHD = 1 posterior mu CI (and width) Contains 0 Y/N
sbp -0.24 - -0.03 (0.21) 0.10 - 0.42 (0.32) CHD=0: N, CHD=1: N
tobacco -0.32 - -0.10 (.22) 0.24 - 0.58 (.34) CHD=0: N, CHD=1: N
ldl -0.30 - -0.08 (.22) 0.20 - 0.52 (.32) CHD=0: N, CHD=1: N
adiposity -0.30 - -0.07 (.23) 0.20 - 0.49 (.29) CHD=0: N, CHD=1: N
typea -0.19 - 0.04 (.23) -0.02 - 0.30 (.32) CHD=0: Y, CHD=1: Y
obesity -0.18 - 0.04 (.22) -0.02 - 0.29 (.31) CHD=0: Y, CHD=1: Y
alcohol -0.16 - 0.07 (.23) -0.07 - 0.25 (.32) CHD=0: Y, CHD=1: Y
age -0.38 - -0.16 (.22) 0.38 - 0.64 (.26) CHD=0: N, CHD=1: N

The table above demonstrates that we are more certain about the CHD=0 means (smaller variance), which makes more sense because we have more observations in the CHD=0 data set (302 vs 160).

The typea, obesity and alcohol 95% CI for the means span 0, indicating that the differences in the means may be due to randomness in the data and are not significant; however, with so few observations in either data set, and because most of the range is isolated on the negative (for CHD = 0)amd positive (for CHD = 1) I would want to see more data to definitively say that typea, obesity and alcohol do not have different mean values in the CHD = 0 and CHD = 1 data sets.

For sbp, tabacco, ldl, adiposity, and age the confidence intervals are either entirely negative (CHD = 0) or positive (CHD = 1) indicating that these mean differences are not do to randomness in sampling.

Now, with Gaussian Unknown Mean and Unknown Precision

I would now like to attempt to produce posterior means and scale using Likelihood Gaussian with Unknown Mean and Unknown Precision.

I am trying this with just the calculations instead of the plotting function used above just to try doing this in a different way.

I am using this slide for setting my parameters:

alt text

To calculate the posterior mean I need to identify/calculate:

  1. tau_prior = weight to decrease the weight of the prior mean compared to the actual mean
    • number between 0 and N.
    • I will use 2 because I have very few samples compared to a population
  2. mu_prior
    • 0 per the question definition
  3. N = number of observations
    • CHD=0: 302
    • CHD=1: 160
  4. mu = from the data for each predictor

To calculate the posterior precision I will need to identify/calculate:

  1. alpha_prior & beta prior
    • I am going to set these to values for a gamma distribution to try to create as close to an uninformative gamma distribution as I have been able to do:
      • alpha_prior: 0.001
      • beta_prior: 0.001
  2. N = number of observations
  3. observation values from the CHD0 and CHD1 dataframes
  4. means for each predictor from the dataframe
  5. tau_prior
    • 2 see above
  6. mu_prior:
    • 0 per the question definition

After I have both the posterior mean, and posterior scale I can compare these with the Gaussian Known Variance analysis. I am very curious about the results and comparison.

Now for CHD = 1 with Gaussian Unknown Mean Unknown Precision

The results from Gaussian Unknown Mean Unknown Precision are significantly different from the previous distribution.

I am confident this second analysis with unknown precision is correct (because we have so few samples in our sub populations so I was very unsure of my data), but the differences are interesting for me to see:

Gaussian Unknown Mean Unknown Precision Means for Predictors in CHD = 0 & CHD = 1

CHD predictor CHD = 0 posterior mu CI (and width) CHD = 1 posterior mu CI (and width) Contains 0 Y/N
sbp -0.145 - -0.133 (0.012) 0.247 - 0.275(0.028) CHD=0: N, CHD=1: N
tobacco -0.222 - -0.211 (0.011) 0.392 - 0.421 (0.029) CHD=0: N, CHD=1: N
ldl -0.196 - -0.184 (0.012) 0.344 - 0.37 (0.026) CHD=0: N, CHD=1: N
adiposity -0.19 - -0.177 (0.013) 0.333 - 0.355 (.022) CHD=0: N, CHD=1: N
typea -0.081 - 0.068 (0.013) 0.127 - 0.152 (0.025) CHD=0: N, CHD=1: N
obesity -0.079 - -0.066 (0.013) 0.123 - 0.148 (0.025) CHD=0: N, CHD=1: N
alcohol -0.051 - -0.039 (0.012) 0.072 - 0.098 (0.026) CHD=0: N, CHD=1: N
age -0.276 - -0.263 (0.013) 0.497 - 0.514 (0.017) CHD=0: N, CHD=1: N

Analysis of CI's from Gaussian Unknown Mean Unknown Precision:

The ranges for the CIs using the correct distribution are much smaller, and all means in the two sub populations appear significantly different; although typea, obesity, and alcohol are not far from 0 for either the CHD=0 or CHD=1 cohorts.

For 5 extra credit points

compute the probability of observing a point at least as extreme as the posterior mean of patients without coronary heart disease under the posterior distribution for the patients with coronary heart disease.

Then compute the probability of observing a point at least as extreme as the posterior mean of patients with coronary heart disease under the posterior distribution for the patients without coronary heart disease.

Answer: Using my univariate analysis, I will use the middle value of the 95% confidence interval:

It seems that alcohol could have values as extreme as the CHD=0 means when using the CHD=1 distributions - meaning that the two subpopulations (chd=0 and chd=1) may have similar rates alcohol use, which does agree with the initial historgrams I plotted for alcohol in chd=0, chd=1, and all.

Question 6 (10 points)

See PowerPoint Attached to Assignment

Question 7 (15 points)

Using the Python Notebook https://www.kaggle.com/billbasener/pt2-probabilities-likelihoods-and-bayes-theorem, complete the challenge question from Section 6: Modify the code from Section 5 to and add the ability to use the posterior from conjugate prior function to output the posterior probability parameters given parameters and for a Gaussian Likelihood with known variance σ2, and use your modified function to create the Prior, Likelihood, Posterior plots as in Section 5 of the notebook.